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1. INTRODUCTION 

We are grateful to our four discussants for 
their agreement with and contributions to the cen- 
tral points in our article (Imai et al. (2009b)). As 
Zhang and Small (2009) write, "[our article] present [s] 



This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2009, Vol. 24, No. 1, 65-72. This 
reprint differs from the original in pagination and 
typographic detail. 



fields of inquiry can be improved in the ways we 
discuss in our article. We do this by collecting data 
from the last 106 cluster-randomized experiments 
published in 27 leading journals in medicine, pub- 
lic health, political science, economics, and educa- 
tion. We then counted how many experiments used 
complete randomization, blocking (on some but not 
all pre-treatment information), or pair-matching — 
which respectively exploit none, some and all of the 
available pre-randomization covariate information. 
Table 1 gives a summary. Overall, only 19% of cluster- 
randomized experiments used pair-matching, which 
means that 81% left at least some pre-randomization 
covariate information on the table. Indeed, almost 
60% of these experiments used complete random- 
ization and so took no advantage of the informa- 
tion in pre-treatment covariates. The table conveys 
that there is some variation in these figures across 
fields, but in no field is the use of pair matching in 
cluster-randomized designs very high, and it never 
occurs in even as many as 30% of published exper- 
iments. Administrative constraints may have pre- 
vented some of these experiments from being pair 
matched, but as using this information involves no 
modeling risks, the opportunities for improving ex- 
perimental research across many fields of inquiry 
seem quite substantial. 

2. HOW TO CONSTRUCT MATCHED PAIRS 

Zhang and Small (2009) offer some creative ideas 
on how to construct matched pairs based on min- 
imizing the total (i.e., across pairs) Mahalanobis- 
based distance metric, which is referred to as an 
"optimal" method. This procedure can be useful 
in many situations, and will usually be superior to 
Mahalanobis-based matching methods that do not 
consider imbalances for all pairs simultaneously. 

This technique, of course, is not always appropri- 
ate. For example, the procedure assumes that Ma- 
halanobis distances make sense for the input data, 
which means that the variance matrix which scales 
the distances is known or can be estimated, and 



convincing evidence that the matched pair design, 
when accompanied with good inference methods, 
is more powerful than the unmatched pair design 
and should be used routinely." And, as they put it, 
Hill and Scott (2009) "do not take issue with [our 
article's] provocative assertion that one should pair- 
match in cluster randomized trials 'whenever fea- 
sible.'" Whether denominated in terms of research 
dollars saved, or additional knowledge learned for 
the same expenditure, the advantages in any one 
research project of switching standard experimental 
protocols from complete randomization to a matched 
pair designs (along with the accompanying new sta- 
tistical methods) can be considerable. 

In the two sections that follow, we address our 
discussants' points regarding ways to pair clusters 
(Section 2) and the costs and benefits of design- 
and model-based estimation (Section 3). But first 
we offer a sense of how many experiments across 
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Table 1 

Percent of recent cluster-randomized experiments in each of four research fields using unblocked, blocked (on a subset of 

available pre-treatment covariates) or pair-matched designs 



Amount of pre-randomization 
design information used 





None 


Some 


All 




Field 


(Unblocked) 


(Blocked) 


(Pair-matched) 


N 


Medicine and public health 


56.2% 


20.5% 


23.3% 


73 


Political science 


71.4 


23.8 


4.8 


21 


Economics 


42.9 


28.6 


28.6 


7 


Education 


80.0 


20.0 


0.0 


5 


Total 


59.4 


21.7 


18.9 


106 



Row totals may not add to 100% due to rounding. For details on these data, see the Appendix. 



that the input variables are close to normal. Per- 
haps even more importantly, the procedure maps 
all the distances to a scalar to measure balance; this 
assumes that the researcher is willing to reduce bal- 
ance within pairs for some pre-treatment variables 
in order to achieve a larger improvement for other 
variables. However, if the set of variables having its 
balance reduced has a bigger impact on the outcome 
than the other set, then the trade-off implied by the 
distance metric would be ill advised. One way to 
avoid these trade-offs is to use a matching method 
without a scalar balance metric, such as "coarsened 
exact matching" which guarantees that the maxi- 
mum possible imbalance for each variable is set by 
ex ante user choice (Iacus et al. (2008)). 

Our qualifications here are minor, of course, as 
most versions of pair matching with a good choice 
of pre-treatment variables would normally represent 
a tremendous improvement over a complete random- 
ization design with respect to bias, power, efficiency, 
and robustness. And Zhang and Small's point is 
clearly correct that one can often do better by con- 
sidering balance on all pairs simultaneously in the 
context of scalar distance-based balancing. 

Finally, we note that constructing matched pairs 
in experimental work is similar to the problem of 
matching in observational causal inference. The tech- 
nologies available for that problem can in some cases 
be adapted for use in matching pre-randomization 
(Greevy et al. (2004); Ho et al. (2007)). A large num- 
ber of these methods, including optimal matching, 
are collected in Matchlt software (Ho et al. (2009)). 



3. MODEL VS. DESIGN-BASED ESTIMATORS 
FOR MATCHED PAIR EXPERIMENTS 

Hill and Scott's (2009) informative commentary 
raises the venerable contrast between model-free and 
model-based estimators, to which we offer four points 
First, we agree that models are sometimes warranted, 
valuable, or unavoidable. For example, our encom- 
passing approach (Section 4.5 in our article) is a 
hierarchical model that adds modeling assumptions 
in order to potentially gain greater efficiency, al- 
though at the risk of greater bias; in our applica- 
tion, we multiply impute missing data with a model 
(Honaker and King (2009)); and the article on the 
design of our experiment proposed modeling to cor- 
rect for certain types of possible experimental fail- 
ures (which, as it turned out, did not materialize) 
(King et al. (2007)). 

Second, models are sometimes useful in providing 
helpful intuition. For example, Hill and Scott (2009) 
write "In some ways, the IKN framework is actually 
quite similar to the multilevel framework that allows 
for variation in treatment effects across pairs." In 
fact, we prove in Section 3.2 that Hill and Scott's 
model without covariates is identical to our design- 
based estimator when the within-pair cluster sizes 
are the same. The two approaches only diverge in 
meaningful ways when covariates are included. 

Third, randomization along with a design-based 
(i.e., model- free) estimator has benefits no model 
can match: instead of inferences that are somewhat 
robust to some types of model misspecification in 
some circumstances, design-based estimators are en- 
tirely invariant to any modeling or ignorability as- 
sumptions. This is the unique and extraordinary 
contribution of the idea of randomization to causal 
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inference, when used with appropriate methods. In 
contrast, in even pristine experimental data, using 
the wrong model can generate bias, inefficiency, higher 
mean square error, and incorrect confidence inter- 
val coverage, especially in small samples (Freedman 

(2008) ). While modeling can improve efficiency un- 
der some circumstances like the simulations of Hill 
and Scott, jettisoning the advantage of randomiza- 
tion by introducing unnecessary modeling assump- 
tions is not something that should be done rou- 
tinely. Although researchers who have put in the 
extra effort and expense, and often special Institu- 
tional Review Board approval to implement a ran- 
domized study may in some situations agree to sac- 
rifice the guarantee of unbiasedness for a chance at 
lower variances, such a choice comes at substantial 
risk. It is no wonder that the vast majority of exper- 
imentalists, recognizing randomization as the great- 
est strength of their research design, abhor unneces- 
sary assumptions and avoid model-based estimation 
in most stuations. See Imai et al. (2008); Imbens 

(2009) . 

Fourth, unnecessary modeling can introduce more 
severe biases when applied in the context of ex- 
perimental failures common in real world applica- 
tions. An important example of this issue occurs 
when controlling for a covariate that is influenced 
by the treatment variable, which can result in post- 
treatment bias. For example, in community-based 
experimental settings, covariates measured in base- 
line surveys just before the introduction of treat- 
ment may capture behavioral changes arising from 
subjects' anticipation of being in the control group 
and from other experimenter and observer interven- 
tion that may differ between treatment and con- 
trol clusters, a common situation in observational 
studies (Ashenfelter (1978)). Indeed, the data gen- 
eration process for the Monte Carlo simulations in 
Hill and Scott (2009) injects this real world post- 
treatment variable problem into the data (see Sec- 
tion 3.1). We show that in this situation model- 
based estimates are not robust to small changes in 
the simulation setup. 

Finally, the most important risk in resorting to 
unnecessary modeling assumptions is the introduc- 
tion of model dependence (King and Zeng (2006); 
Ho et al. (2007)). Indeed, we show analytically in 
Section 3.3 and via simulation in Section 3.4 that 
model-based inferences in experimental data can be 
highly model dependent. We then offer two simu- 
lated examples. In one, changing a linear modeling 



assumption to a nonlinear modeling assumption pro- 
duces large biases and incorrect confidence interval 
coverage, and in such a way that model fit tests do 
not avoid. And in the other, we show that adjust- 
ing for a pre-treatment but incorrect covariate can 
produce inefficient estimates and lead to confidence 
intervals with inaccurate coverage when compared 
to the design-based estimator. 

3.1 The Data Generation Process 

We begin with Hill and Scott's (2009) data gener- 
ating process. For individual i, in cluster j (j = 1,2), 
and pair k (k = 1,...,K), we generate individual 

level potential outcomes as Yij k (t) M(Yj k (t), a\ ), 
where t = 1 is treated, t = is control. Under their 
data generating process, 

(1) ^(0)^-^0,^), 

(2) Y 2k (0) U =Y lk (0) + 5 k , 5 k ~M(0,a 2 s ), 

(3) Y lk (l) = Y lk (0) + T lk , 

(4) Y 2k (l) = Y. 2k (0) + r 2k , 

where /io is the mean cluster-level potential outcome 
under control, and as represents the standard de- 
viation of within-pair imbalance. Furthermore, Hill 
and Scott set the causal effect (the difference in 
the potential outcomes, averaged over all individu- 
als within a cluster) as Tj k = 30/1^(0). This spec- 
ification implies that Tj k does not have finite mo- 
ments and thus the population average treatment 
effect does not exist. 

Hill and Scott further assume that the cluster is 
treated (t = 1) if j = 2 and not (t = 0) if j = 1. 
This means that the distributions of potential out- 
comes are different between the treatment and con- 
trol groups, which indicates that this is a simula- 
tion where the randomization failed: Although the 
means of the potential outcome are the same, their 
variances are different unless as = 0. 

Hill and Scott then generate their cluster-level co- 
variate as 

Xjk = Xj k (Tj k ) = Y\fc(0) + Cjk 

(5) 

= Xifc(0) + T jk 5 k + (jfc, 
where Tj k is the cluster-level treatment indicator 

and Qj k J\f(0,aj). The specification implies that 
Xj k is a post-treatment covariate since the distri- 
bution of Xj k is a consequence of treatment and, 
in particular, different between the treatment and 
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control groups. Again, although the mean of Xj k is 
the same (and equal to /iq), its variance is different 
unless (J5 = 0. Note especially that all random devi- 
ations from the normal draw of Xifc(0) are reflected 
in Xjfc, which accounts for its fit to the data. 

In their simulations, the results from which we 
replicate exactly, Hill and Scott sample cluster sizes 
from a multinomial distribution with a mean of 50. 
In the simulations we present here, we similarly sam- 
ple cluster sizes from a multinomial distribution, but 
vary the average cluster size to represent other typ- 
ical cluster-randomized experimental settings that 
commonly employ fewer clusters than were used in 
the Seguro Popular evaluation. 

3.2 The Model without Covariates: Equivalent 
to the Design-Based Estimator 

Hill and Scott propose the following model with- 
out covariates, which we show here is equivalent 
to our design-based estimator when the within-pair 
cluster sizes are the same: 



(6) 
(7) 



TkTjk + a k + e 



ijkj 



where e 

2 



■k 1 ~'JV(0,<7, 



cr ar 

■2 



\a J ' \ a aT cr 

where r k is the pair-specific average treatment effect 
and €ijk -U- (r k ,a k ). We rewrite this model as 

Yijk | T jk ' J\f[T T jk + ao, 

a 2 + T jk {a 2 T + 2a ar ) + a 2 a }. 
Then, the maximum likelihood estimate of tq is 



(8) 



TO 



(9) 



k=l 2^j=l 2^i=i -Ljkijjk 
2~2k=l 2~2j=i Tj k rij k 

_ 2~2k=i J2j=i J2j=i(^ ~ Tj k )Yjj k 
Efc=i E?=i(l - T jk )n jk 



which is identical to our design-based estimator when 
the within-pair cluster sizes are the same. In simu- 
lations, we find that this estimator is quite similar 
to the design-based estimator even when the within- 
pair cluster sizes are different. 

3.3 The Model with Covariates 

Consider a generalized version of the model in Sec- 
tion 3.2 with a covariate: 



(10) 



Yijk = a k + g{Xj k )f3 + T k T k + Bijk, 



where g(-) is an assumed function specified as part of 
the model and (r k ,a k ) is distributed as the bivari- 
ate normal in equation (7). Hill and Scott (2009) 
consider a special case of this model with a post- 
treatment covariate [see equation (5)] , such that 

g(X jk ) = g(X jk (T jk )) = g(Y jk (0)) 

(11) 

= g(Y lk (0) + T jk 5 k + Cjk), 

and with the linear functional form restriction, 
g(x) = x. 

If we estimate this general model using Hill and 
Scott's post-treatment covariate, the crucial ques- 
tion is what quantity is being estimated. We denote 
this estimand as r* and characterize the difference 
between it and the average treatment effect (under 
this model) as follows (see Rosenbaum (1984)): 



(12) 



(13) 



-E{r k ) 
E{E{Y l]k 



T jk 



EiYij k | T, 



jk 



0,X jk )}-E(T k ), 



E{E{Y ijk \T jk = l,X jk (l)) 
- E{Y ijk | T jk = 0,X jk (0))} - E{r k ), 



(14) 



where e ijk '~ d ' 7V(0,of ), 



= E{g(Y. lk (0) + 6 k + Cjk) 

-g(Yik(0)+Cjk)}/3. 

The model dependence of Hill and Scott's specifica- 
tion can be seen in the last line: When g(x) = x as 
they assume, then the last line equals and dis- 
crepancy between the estimand and the quantity 
of interest vanishes. However, if g{-) is not a lin- 
ear function, then the quantity being estimated by 
this model, r*, does not in general equal the average 
treatment effect, that is, E(r k ). The degree of dis- 
crepancy thus solely depends on the functional form 
assumption, which of course is a clear case of model 
dependence. 

3.4 Simulations 

We perform two simulations which are based on, 
but not identical to, Hill and Scott's simulation setup. 
Our goal in this section is to offer a more general 
illustration of model dependence than in Hill and 
Scott's setup. To do so, in both simulations, we 
correct the randomization failure by properly ran- 
domizing the treatment and address the dvivide-by- 
zero problem by using a left-truncated normal dis- 
tribution (instead of a normal distribution without 
truncation) with a truncation point of 2. We then 
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examine the consequence of adjusting for the post- 
treatment variable (the first simulation) as well as 
the pre-treatment variable (the second simulation). 
We run our simulation for 2,000 iterations; details 
appear in our replication data archive (Imai et al. 
(2009a)). 

For the first simulation, we set g{Xj k ) = 
log(Xifc(0)) + r] jk if T jk = and g{X jk ) = 

exp(y 2 fc(0)) + r] jk if T jk = 1, where r) jk ' JV(0, 2), 
the values of which are fixed over simulations. The 
results of this first simulation appear in the left col- 
umn of Figure 1. The horizontal axis for each graph 
is the standard deviation of added (post-treatment) 
imbalance, which is denoted by as (see Section 3.1). 
We present results for our design-based estimator 
(solid line), the model-based estimator with a covari- 
ate (dashed line) , and — to evaluate whether it might 
be possible to test one's way out of the problems — a 
pre-test model-based estimator using the likelihood 
ratio test to decide for each simulation whether to 
include the covariate, as done in Hill and Scott's 
simulations (dotted line). 

The estimated bias presented in the top left graph 
shows that while our design-based estimator is ap- 
proximately unbiased, the model-based and pre-test 
estimators are severely biased for a wide range of 
the simulations. The variance of each of the esti- 
mators is relatively small, and so with large bias 
the root mean square error is mostly irrelevant, but 
it too indicates (in the middle left graph) that the 
design-based estimator is superior. The estimated 
coverage probability of the 95% confidence interval, 
displayed for the two estimators in the bottom left 
graph, stays approximately at the nominal level for 
our design-based estimator but is far from valid over 
much of the range for the model-based and pre-test 
approaches. 

For our second set of simulations, we examine the 
consequence of adjusting for the pre-treatment co- 
variate using a model-based approach. We adopt 
a data generating process similar to that of Hill 
and Scott's simulations, but use a different specifi- 
cation for the pre-treatment cluster-level covariate; 
X lk = log(rifc(0)) + ri jk and X 2k = exp(Y 2 fc(0)) + 
r]j k , where r\j k 1 ~ ' JV(0, 2). In addition, because many 
community-based cluster-randomized experiments in 
public health and education are forced to use as few 
as 5 to 10 pairs, we reduced the sample size to twenty 
clusters of average size 15. 



The results from this second simulation appears 
in the right column of Figure 1, again for design- 
based (solid line), model-based (dashed) and pre- 
test (dotted) estimtors. As expected, the top right 
graph shows that all three estimators are approx- 
imately unbiased because we no longer adjust for 
post-treatment covariates (although the bias is slightly 
smaller for the pre-test and design-based estima- 
tors than the model-based approach). The middle 
right graph shows that the design based estimator 
has uniformly lower root mean square error than 
the other two approaches. The bottom right graph 
shows that our design-based approach produces ap- 
proximately correct coverage across varying levels 
of within-pair imbalance, while the model-based and 
pre-test estimators produce confidence intervals that 
are somewhat too narrow. 

3.5 How to Use Pre- Treatment Information 

Introducing models into randomized experiments 
can improve estimation or make it worse. Hill and 
Scott have given examples where specific models 
out-perform design-based estimators. With similar 
models and data generation processes we show here 
that models can also under-perform relative to design- 
based estimators. Although diagnostic tests can some- 
times help an analyst choose the correct strategy 
from the data, the differences can be subtle and in 
many situations, such as the ones we illustrate here, 
standard tests cannot detect model failures. None 
of these points are new, but it is useful to have ex- 
amples of each issue laid out with the clarity this 
Symposium has made possible. 

Given these issues, our recommendation, along with, 
it seems, our discussants, is to avoid modeling choices 
by using pair matching as part of the design of cluster- 
randomized experiments on all available covariates 
prior to randomization. This allows researchers to 
obtain efficiency gains of modeling without risking 
the statistical advantages of random assignment. If 
exact pair matching is possible, then model depen- 
dence is eliminated and the difference between many 
model-based and design-based estimators will van- 
ish. When exact matching is not possible, then the 
user may choose to introduce a model if the risks of 
that approach are not outweighed by the benefits of 
guaranteed unbiasedness due to randomization. In 
many cases, such as with noncompliance and miss- 
ing data, models may be unavoidable. 
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Bias 

Randomized, Post-Treatment X 



— Design-Based 
- Model-Based 
Pre-Test 



Standard Deviation of Added Imbalance 
RMSE 

Randomized, Post-Treatment X 



— Design-Based 
- Model-Based 
■ ■ Pre-Test 



Standard Deviation of Added Imbalance 

Coverage Probability, 95% Conf. Interval 
Randomized, Post-Treatment X 



Design-Based 
Model-Based 
Pre-Test 



Bias 

Randomized Experiment 




Standard Deviation of Added Imbalance 
RMSE 

Randomized Experiment 




Standard Deviation of Added Imbalance 

Coverage Probability, 95% Conf. Interval 
Randomized Experiment 



Design-Based 

- - Model-Based 
■ • ■ Pre-Test 



Standard Deviation of Added Imbalance 



Standard Deviation of Added Imbalance 



Fig. 1. Model dependence. For the design-based (solid), model-based (dashed) and pre-test (dotted) estimators, we present 
the bias (top row), root mean square error (middle row) and confidence interval coverage (bottom row). The left column 
demonstrates model dependence from the simulation in Hill and Scott by changing only the model to add nonlinearity; the 
right column gives an example where even under proper randomization inclusion of a covariate can worsen RMSE and the 
coverage probability. 
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4. CONCLUDING REMARKS 

We developed the arguments, methods and evi- 
dence for our article in the context of a large ran- 
domized study of the Mexican universal health care 
system, Seguro Popular (King et al. (2007), 2009). 
Using the matched pair design for cluster random- 
ization and our design-based statistical methods 
means that we were able to save a great deal of 
money and produce far more informative causal ef- 
fects without risky assumptions. As our discussants 
have made clear, these results should be widely ap- 
plicable, and the matched pair design should be used 
whenever feasible. Fortunately, in cluster-randomized 
studies, matching clusters in pairs usually is feasi- 
ble, at least much more so than for some classes of 
unit-randomized studies. As our content analysis of 
the scholarly literature shows, there is much room 
for improvement in the practice of experimental de- 
sign; this symposium offers a clear path to saving 
research resources and unearthing far more informa- 
tion, in cluster randomized experiments, than has 
been understood heretofore. 

We thank our discussants again for their infor- 
mative contributions, and we look forward to many 
applications across fields of inquiry, as well as new 
research that pushes forward experimental design in 
ways that continue to make possible more scientifi- 
cally valid and efficient public policy evaluations. 

APPENDIX: JOURNALS INCLUDED IN 
CONTENT ANALYSIS 

We included journals in the content analysis re- 
ported in Table 1 if they published at least one 
cluster-randomized trial during the study period, 
which was 2003-2009 for political science and 2006- 
2009 for the others. The journals included are as 
follows. 

Medicine and public health: American Journal of 
Public Health, American Journal of Sports Medicine, 
Annals of Internal Medicine, British Medical Jour- 
nal, Journal of the American Medical Association, 
Lancet, Medicine & Science in Sports & Exercise, 
New England Journal of Medicine. Economics: Amer- 
ican Economic Review, Econometrica, Journal of Po- 
litical Economy, Journal of Policy Analysis and Man- 
agement. Education: American Education Research 
Journal, American Journal of College Health, Ed- 
ucational Evaluation and Policy Analysis. Political 
science: American Behavioral Scientist, American 



Journal of Political Science, American Political Sci- 
ence Review, American Politics Research, Annals of 
the American Academy of Political and Social Sci- 
ence, Comparative Political Studies, Electoral Stud- 
ies, Journal of Politics, Political Analysis, Political 
Psychology, Political Research Quarterly, and PS: 
Political Science and Politics. 
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